-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New logger connector code #7882
Conversation
Hello @carmocca! Thanks for updating this PR.
Comment last updated at 2021-06-08 19:35:08 UTC |
Codecov Report
@@ Coverage Diff @@
## master #7882 +/- ##
=======================================
- Coverage 92% 86% -6%
=======================================
Files 202 204 +2
Lines 13159 13630 +471
=======================================
- Hits 12155 11788 -367
- Misses 1004 1842 +838 |
# reset result collection for next epoch | ||
self.trainer.result_collection.reset(metrics=True) | ||
|
||
def teardown(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is that something we should do in the loops maybe?
just saying, no change requersted
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The trainer will call this:
def _post_dispatch(self):
self.accelerator.post_dispatch(self)
self.accelerator.teardown()
self.logger_connector.teardown()
pytorch_lightning/trainer/connectors/logger_connector/logger_connector_new.py
Show resolved
Hide resolved
very nice! |
pytorch_lightning/trainer/connectors/logger_connector/logger_connector_new.py
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM ! Great job @carmocca
Evaluation metric updates | ||
""" | ||
|
||
def prepare_eval_loop_results(self, metrics: Mapping[str, _METRIC]) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could be move to utils section.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why? It's tied to evaluation
pytorch_lightning/trainer/connectors/logger_connector/logger_connector_new.py
Outdated
Show resolved
Hide resolved
self.trainer.train_loop.results.cpu() | ||
self.trainer.evaluation_loop._val_results.cpu() | ||
self.trainer.evaluation_loop._test_results.cpu() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@awaelchli do you think each loop should run this on their teardown?
instead of the logger connector doing it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these results are the resultscollection right? I think yes, if the loops own this results collection I think they should handle the teardown
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay. Do you think I should do that change here or you do it when it's all merged?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah we shouldn't hold this back. Added a TODO to our POC.
elif step is None: | ||
# added metrics by Lightning for convenience | ||
scalar_metrics['epoch'] = self.trainer.current_epoch | ||
step = self.trainer.global_step |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a lot of Codecov warnings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is expected considering we haven't integrated the loops to use these new files.
You can take a peek at #7631 for the full integration
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR includes a few unit tests for the code in Result
but LoggerConnectorNew
is completely unused
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! LGTM 😃 minor comment
pytorch_lightning/trainer/connectors/logger_connector/logger_connector_new.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, the integration with torchmetrics seems nice and definitely clears up some of the older logging code :]
pytorch_lightning/trainer/connectors/logger_connector/logger_connector_new.py
Outdated
Show resolved
Hide resolved
|
||
|
||
# TODO(@carmocca): Remove `New` suffix | ||
class LoggerConnectorNew: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
super nit: for renamings i've always been a fan of _v2 instead of New in the worst case circumstance that there are 3 or more versions around at the same time
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the tip! Will keep it in mind for future refactors
# added metrics by Lightning for convenience | ||
scalar_metrics['epoch'] = self.trainer.current_epoch | ||
step = self.trainer.global_step |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
n00b questions:
- why does step indicate we should epoch to scalar metrics?
- what happens if epoch was already in scalar metrics as another value? if the user does
self.log("epoch", ...)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why does step indicate we should epoch to scalar metrics?
Didn't get this, can you rephrase?
what happens if epoch was already in scalar metrics as another value?
Well, as you see here, it gets overwritten.
It wouldn't be a bad idea to check if it's there already. Will do in a future PR
class MetricSource(LightningEnum): | ||
CALLBACK = "callback" | ||
PBAR = "pbar" | ||
LOG = "log" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are these metric sources, or are they destinations?
in the future, i can imagine we want to configure multiple loggers, but want to log metrics to only some of them. how could this be extended for that use case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are these metric sources, or are they destinations?
Mmm depends on how you think about it.
Do you self.log(prog_bar=True)
into the progress bar metrics (destination) or are the progress bar metrics gathered from the prog_bar=True
logged values (source)?
If you have an idea for a less confusing name, we can rename without issues 👍
in the future, i can imagine we want to configure multiple loggers, but want to log metrics to only some of them. how could this be extended for that use case?
In that case, I think it's easier for us to filter at the logger level. _Metadata
could have a field for the target logger
|
||
def check_fn(v): | ||
if v.grad_fn is not None: | ||
raise MisconfigurationException(f'You returned a tensor with `grad_fn`. The extra values are {extra}') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you explain what this checks for? Is it a misconfiguration for users to return tensors with gradients enabled?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC the previous implementation detached extras (silently) internally, however, I feel like it's better for the users to know they are passing tensors with graphs. This facilitates catching potential errors on the user side and it's not much to ask the user to detach if necessary.
Do you think this constraint should be relaxed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not entirely sure - just noticed that quite a few of our users were relying on this (silent) behavior and will see exceptions with the latest version of the code here.
Could you share an example of how/what a users should do now? Is the idea they want to call detach
on non-loss tensors?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the idea they want to call detach on non-loss tensors?
Exactly:
def training_step(...):
...
# `something_else` is considered an "extra"
return {'loss': loss, 'something_else': something_else.detach()}
…ter) to github/third-party/PyTorchLightning/pytorch-lightning Summary: ## OSS Note these issues are being solved in OSS here: https://github.com/PyTorchLightning/pytorch-lightning/pull/7994/files# ## Manual - `speed_monitor.py` - `Result.unpack_batch_size` has been removed, moved to new implementation. - `fully_sharded.py` - There was a refactor for plugins, so updated corresponding function to keep reduced memory usage. - `hive_writing_classy.py`, `hive_writing_faim.py`, `hive_writing_xrayvideo.py` - Same as `speed_monitor.py`. - [Temporary] Uncommented misconfiguration exception. See Lightning-AI/pytorch-lightning#7882 (review). - Update `TestModel` to detach appropriately. - Manually `detach` metrics stored in ResultStore. ## Automatic ### New commit log messages f7459f53 DeepSpeed Infinity Update (#7234) 03e7bdf8 Improve `LightningModule` hook tests (#7944) 3a0ed02b Properly handle parent modules w/ parameters in `BaseFinetuning` callback (#7931) ce93d8bc Handle errors due to uninitailized parameters (#7642) cca0e753 remove parsing comments (#7958) 898fb56b added on_test_start() documentation (#7962) 22d82661 Seed all workers when using DDP (#7942) 436fc53c Improve `LightningDataModule` hook test and fix `dataloader_idx` argument (#7941) 6b7b4047 deprecate hpc_load() and integrate it with restore() (#7955) 20a5e09e fix myst-parser warning blocking docs ci (#7967) f15ea601 update chlog + legacy chpt (#7954) 59d0c656 Add dataclass support to `apply_to_collection` (#7935) cdd01f32 LightningCLI support for argument links applied on instantiation (#7895) 6856cced Remove rank_zero_only on DataModule prepare_data (#7945) 96433d03 IPU Integration 5/5 (#7867) 42c7f272 refactor checkpoint loading for training type plugins (#7928) ac4eb0a0 `is_overridden` improvements (#7918) 9e932f4d Delete `on_after_backward` unused argument (#7925) 8b738693 Deprecate the default `EarlyStopping` callback monitor value (#7907) c1eac483 split `restore_training_state` into logical parts [2 / 2] (#7900) d209b689 split `restore_training_state` into logical parts [1 / 2] (#7901) 111287b4 add pre-commit hooks (#7906) 839019a3 Remove legacy teardown check in train loop (#7917) b45a89a2 Clean-up after logger connector redesign 2/2 (#7631) 07b69231 Remove fn check for ipu output (#7915) 580a3b5e Remove dead code (#7910) df812398 Clean-up after logger connector redesign 1/2 (#7909) ec4f8856 Enable logger connector re-design (#7891) 15be9865 add logger to __all__ (#6854) 6fee9262 Deprecate `LightningDataModule` lifecycle properties (#7657) 764d2c77 refactor CheckpointConnector.restore_weights (#7862) 7f4ef6d1 Fix logs overwriting issue for remote fs (#7889) c310ce66 Logger connector re-design `_Metadata.reduce_fx` fixes. (#7890) b214442e New logger connector code (#7882) Reviewed By: yifuwang Differential Revision: D29105294 fbshipit-source-id: 990b2a4a7333908d676de193f5ec930cb50b8a19
What does this PR do?
Introduces the new code for #7631
Next PRs:
cc: @SkafteNicki please review our heavy use of
Metric
inresult_new.py
Before submitting
PR review